In [1]:
In [2]:
/home/criuser/miniconda3/lib/python3.9/site-packages/gensim/similarities/__init__.py:15: UserWarning: The gensim.similarities.levenshtein submodule is disabled, because the optional Levenshtein package <https://pypi.org/project/python-Levenshtein/> is unavailable. Install Levenhstein (e.g. `pip install python-Levenshtein`) to suppress this warning.
  warnings.warn(msg)
In [3]:
/home/criuser/miniconda3/lib/python3.9/site-packages/ipykernel/ipkernel.py:283: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future. Please pass the result to `transformed_cell` argument and any exception that happen during thetransform in `preprocessing_exc_tuple` in IPython 7.17 and above.
  and should_run_async(code)

I - Read Data

In [4]:
Out[4]:
id user_id created_at cooked post_number updated_at reply_count reply_to_post_number quote_count incoming_link_count ... cleaned_text lemmat_text token_text no_sw_text noHTML_text lemma_LDA_text no_sw_LDA_text token_NN_text creation_date creation_year
0 6997.0 884 2016-03-14 17:40:17.744000+00:00 <p>The size of the mask plus electronics is ra... 3.0 2016-03-14 17:40:42.276000+00:00 1.0 2.0 0.0 10.0 ... mask plus electronics bulky foam material unde... mask plus electronics bulky foam material unde... ['mask', 'plus', 'electronics', 'bulky', 'foam... mask plus electronics bulky foam material unde... The size of the mask plus electronics is rath... mask electronics material mask sweat issue str... mask electronics material mask sweat issue str... ['mask', 'electronics', 'material', 'mask', 's... 2016-03-14 17:40:17.744000+00:00 2016
1 3798.0 884 2014-02-24 22:58:11+00:00 <aside class="quote no-group quote-modified" d... 4.0 2015-10-21 04:24:48.934000+00:00 0.0 NaN 1.0 0.0 ... agaricus restedness point scale knowing middle... agaricus restedness point scale know middle co... ['agaricus', 'restedness', 'point', 'scale', '... agaricus restedness point scale middle cover l... \n \n \n Agaricus: \n \n For “restedness” I... agaricus restedness point scale cover range fo... agaricus restedness point scale cover range fo... ['agaricus', 'restedness', 'point', 'scale', '... 2014-02-24 22:58:11+00:00 2014
2 8586.0 884 2016-11-05 13:46:33.686000+00:00 <p>The <a href="https://biostrap.com">Biostrap... 15.0 2016-11-05 13:47:19.301000+00:00 1.0 NaN 0.0 20.0 ... biostrap looks promising shipping december cla... biostrap look promise ship december claim offe... ['biostrap', 'look', 'promise', 'ship', 'decem... biostrap promise ship december claim offer hr ... The Biostrap looks promising (and is shippi... biostrap claim offer hr ppg sensor hrv rest re... biostrap claim offer hr ppg sensor hrv rest re... ['biostrap', 'claim', 'offer', 'hr', 'ppg', 's... 2016-11-05 13:46:33.686000+00:00 2016
3 7124.0 884 2016-04-14 13:50:17.159000+00:00 <p>I’ve <a href="http://www.quantifiedbob.com/... 1.0 2016-04-14 13:50:17.159000+00:00 0.0 NaN 0.0 878.0 ... posted lengthy writeup recent experience repli... post lengthy writeup recent experience replica... ['post', 'lengthy', 'writeup', 'recent', 'expe... post lengthy writeup recent experience replica... I’ve posted a lengthy writeup of my recent e... experience diet study cell metabolism institut... experience diet study cell metabolism institut... ['experience', 'diet', 'study', 'cell', 'metab... 2016-04-14 13:50:17.159000+00:00 2016
4 8707.0 884 2016-11-15 18:36:40.786000+00:00 <p>Personally, I want to be able to access tim... 6.0 2016-11-15 18:36:40.786000+00:00 1.0 NaN 0.0 4.0 ... personally able access series data breath temp... personally able access series data breath temp... ['personally', 'able', 'access', 'series', 'da... personally able access series data breath temp... Personally, I want to be able to access time ... access series data breath tempo second second ... access series data breath tempo second second ... ['access', 'series', 'data', 'breath', 'tempo'... 2016-11-15 18:36:40.786000+00:00 2016

5 rows × 49 columns

In [5]:
Out[5]:
(10282, 49)
In [6]:
In [7]:
Out[7]:
(10282, 49)
In [8]:
Out[8]:
(10282, 49)

II - Exploratory Data Analysis

1 - Preprocessing Topic/Post Text

  • Data Wrangling
  • Text cleaning
  • Text normalization (Lemmatization)
  • Text Tokenization / Remove stop words
In [9]:

Clean Text

In [10]:
In [11]:
In [12]:
In [13]:
In [14]:
Out[14]:
cooked cleaned_text
0 <p>The size of the mask plus electronics is rather bulky and the foam material on the underside of the mask sometimes makes my forehead sweat so I’m guessing the slippage issue/waking me up is being caused by this.</p>\n<p>The mask has an elastic strap, but unfortunately it’s non-adjustable. I’m... the size of the mask plus electronics is rather bulky and the foam material on the underside of the mask sometimes makes my forehead sweat so i guessing the slippage issue waking me up is being caused by this the mask has an elastic strap but unfortunately it non adjustable i thinking about cut...

Clean StopWords

In [15]:
In [16]:
[nltk_data] Error loading stopwords: <urlopen error [Errno -3]
[nltk_data]     Temporary failure in name resolution>
Out[16]:
False
In [17]:
In [18]:

Lemmatize Words

In [19]:
In [20]:
[nltk_data] Error loading punkt: <urlopen error [Errno -3] Temporary
[nltk_data]     failure in name resolution>
[nltk_data] Error loading wordnet: <urlopen error [Errno -3] Temporary
[nltk_data]     failure in name resolution>
Out[20]:
False
In [21]:
In [22]:

Create Lemmetized text POS Noun for LDA Model

In [23]:
In [24]:
In [25]:

Tokenize words

In [26]:
In [27]:
In [28]:
Out[28]:
lemmat_text lemma_LDA_text
0 mask plus electronics bulky foam material underside mask forehead sweat guess slippage issue wake cause mask elastic strap adjustable cut reattaching stay snug mask electronics material mask sweat issue strap stay snug
1 agaricus restedness point scale middle cover large range allow focus extreme gary point pick scale data constantly establish threshold value example pain wise value compare establish value mean relation large set data original measure adjust begin point scale convert point scale double value ear... agaricus restedness point scale cover range focus extreme point pick scale data threshold value example wise value value relation data measure point scale convert point scale value measure granularity
2 biostrap promise ship december claim offer hr ppg sensor hrv rest spo2 respiration rate sleep analysis secondary sensor clip shoe assist exercise activity track offer api access launch ask open api access reply plan open api launch likely future customer request biostrap claim offer hr ppg sensor hrv rest respiration rate analysis sensor clip exercise activity access launch access plan launch customer request
3 post lengthy writeup recent experience replicate fast mimic diet base study publish cell metabolism fund national institute age researcher cut daily calorie half reduce biomarkers age diabetes heart disease cancer adverse effect main guideline total caloric intake kcal lb body weight protein fat... experience diet study cell metabolism institute researcher calorie biomarkers diabetes heart disease cancer effect guideline intake lb body carbs intake lb body carbs calorie calorie ketone mid mmol ketosis ketosis result body loss weight body muscle mass effect rebound effect testosterone focus...
4 personally able access series data breath tempo inhale second hold second exhale second hold second vs average breath access series data breath tempo second second second second breath
In [29]:

2 - Word frequency analysis

Word Frequency Data Viz

In [30]:
Out[30]:
Index(['id', 'user_id', 'created_at', 'cooked', 'post_number', 'updated_at',
       'reply_count', 'reply_to_post_number', 'quote_count',
       'incoming_link_count', 'reads', 'readers_count', 'topic_id',
       'reply_to_user', 'stream', 'tags', 'title', 'posts_count', 'views',
       'like_count', 'closed', 'category_id', 'word_count', 'featured_link',
       'time_read', 'likes_received', 'likes_given', 'topics_entered',
       'topic_count', 'post_count', 'posts_read', 'days_visited', 'username',
       'name', '_merge', 'first_post', 'last_post', 'lifespan',
       'lifespan_days', 'cleaned_text', 'lemmat_text', 'token_text',
       'no_sw_text', 'noHTML_text', 'lemma_LDA_text', 'no_sw_LDA_text',
       'token_NN_text', 'creation_date', 'creation_year'],
      dtype='object')
In [31]:
In [32]:
In [34]:
Out[34]:
Text(0.5, 1.0, 'TOP 30 frequent words which occurred in the posts/topics')
<Figure size 1440x720 with 0 Axes>

word frequency for 2020/2021

In [59]:
In [61]:
In [62]:
In [150]:
Out[150]:
word frequency
124 data 1115
154 tracking 399
1 sleep 292
309 self 292
43 app 271
362 work 210
83 qs 191
109 track 184
298 people 182
317 amp 179
14 hrv 177
113 health 172
19 rate 171
18 heart 159
191 different 154
440 days 152
127 things 139
410 new 138
576 oura 135
383 effect 135
90 project 132
82 analysis 130
821 research 129
182 blood 126
334 going 118
225 measure 117
20 body 115
713 looking 115
53 help 112
187 better 110

3 - Word cloud

In [151]:

QS Image WordCloud

In [152]:

4 - Bigrams

  • List most occured bigrams in the posts/topics to explore more in depth
In [153]:
Out[153]:
Text(0.5, 1.0, 'TOP 20 pair words which occurred in the topics/posts')
<Figure size 1440x720 with 0 Axes>
In [154]:
Out[154]:
Text(0.5, 1.0, 'TOP 20 three words which occurred together in the topics/posts')
<Figure size 1440x720 with 0 Axes>
In [ ]:

4.1 Keyword Analysis: Lexical Dispersion Plot

In [66]:
In [67]:
In [68]:
<ipython-input-68-271ac090cfdd>:1: FutureWarning: Indexing with multiple keys (implicitly converted to a tuple of keys) will be deprecated, use a list instead.
  words_df = test.groupby('creation_year')['data', 'track', 'sleep', 'work', 'app', 'device',
Out[68]:
data track sleep work app device self people help test ... change project sensor tool different activity idea apps record research
creation_year
2011 549.0 312.0 207.0 299.0 397.0 72.0 248.0 227.0 203.0 170.0 ... 149.0 97.0 66.0 87.0 71.0 61.0 103.0 48.0 81.0 94.0
2012 826.0 401.0 208.0 297.0 573.0 118.0 181.0 227.0 161.0 141.0 ... 96.0 96.0 32.0 118.0 117.0 89.0 90.0 84.0 52.0 103.0
2013 2072.0 680.0 792.0 829.0 1339.0 432.0 335.0 340.0 417.0 216.0 ... 283.0 178.0 320.0 160.0 208.0 213.0 243.0 163.0 205.0 187.0
2014 1597.0 837.0 736.0 749.0 1245.0 404.0 356.0 306.0 353.0 276.0 ... 170.0 197.0 324.0 222.0 171.0 208.0 229.0 194.0 160.0 200.0
2015 815.0 472.0 247.0 304.0 778.0 177.0 175.0 117.0 172.0 79.0 ... 106.0 70.0 74.0 91.0 65.0 67.0 82.0 125.0 74.0 93.0
2016 1080.0 470.0 566.0 360.0 821.0 253.0 133.0 161.0 170.0 211.0 ... 96.0 101.0 148.0 135.0 97.0 112.0 104.0 123.0 97.0 125.0
2017 426.0 427.0 340.0 223.0 487.0 122.0 121.0 112.0 122.0 162.0 ... 75.0 66.0 71.0 74.0 82.0 86.0 78.0 72.0 71.0 70.0
2018 426.0 282.0 203.0 202.0 395.0 121.0 78.0 102.0 111.0 170.0 ... 55.0 61.0 22.0 53.0 73.0 50.0 64.0 44.0 60.0 55.0
2019 729.0 449.0 226.0 290.0 587.0 111.0 169.0 129.0 178.0 142.0 ... 96.0 132.0 73.0 96.0 75.0 68.0 105.0 63.0 118.0 61.0
2020 909.0 570.0 195.0 356.0 636.0 127.0 257.0 140.0 154.0 138.0 ... 143.0 155.0 61.0 105.0 125.0 90.0 116.0 83.0 98.0 113.0
2021 293.0 178.0 165.0 153.0 161.0 41.0 58.0 45.0 56.0 36.0 ... 26.0 31.0 16.0 20.0 33.0 31.0 30.0 13.0 40.0 38.0

11 rows × 30 columns

In [69]:
Out[69]:
Index(['data', 'track', 'sleep', 'work', 'app', 'device', 'self', 'people',
       'help', 'test', 'start', 'health', 'measure', 'zeo', 'post', 'new',
       'question', 'rate', 'user', 'heart', 'change', 'project', 'sensor',
       'tool', 'different', 'activity', 'idea', 'apps', 'record', 'research'],
      dtype='object')

Most frequent words in QS over the years of 2011 to 2021

In [70]:
201220142016201820200500100015002000
variabledatatracksleepworkappdeviceselfpeoplehelpteststarthealthmeasurezeopostnewquestionrateuserheartchangeprojectsensortooldifferentactivityideaappsrecordresearchcreation_yearvalue

1

In [73]:

Merge the different threads of the topic_id into one topic to have 2117 documents

In [ ]:
In [75]:

5 - Topic Modelling

In [76]:
In [79]:
Out[79]:
2117
In [80]:
Out[80]:
1457
In [81]:
Out[81]:
<gensim.corpora.dictionary.Dictionary at 0x7fe98d111430>
In [82]:
CPU times: user 45.2 s, sys: 35.5 s, total: 1min 20s
Wall time: 13.6 s

LDA

In [84]:
In [85]:
In [86]:
Topic: 0 
Words: 0.034*"zeo" + 0.017*"device" + 0.012*"headband" + 0.012*"sleep" + 0.010*"app" + 0.010*"work" + 0.010*"file" + 0.009*"post" + 0.008*"sensor" + 0.008*"version"
Topic: 1 
Words: 0.012*"people" + 0.012*"tool" + 0.012*"work" + 0.011*"question" + 0.009*"device" + 0.009*"research" + 0.009*"mood" + 0.009*"idea" + 0.007*"information" + 0.007*"user"
Topic: 2 
Words: 0.015*"blood" + 0.014*"sensor" + 0.010*"device" + 0.008*"health" + 0.007*"measurement" + 0.007*"question" + 0.007*"work" + 0.007*"pressure" + 0.007*"phone" + 0.007*"temperature"
Topic: 3 
Words: 0.021*"health" + 0.012*"app" + 0.012*"device" + 0.011*"work" + 0.010*"information" + 0.008*"track" + 0.007*"post" + 0.007*"export" + 0.007*"question" + 0.007*"level"
Topic: 4 
Words: 0.016*"app" + 0.012*"health" + 0.011*"people" + 0.011*"activity" + 0.011*"file" + 0.008*"sleep" + 0.008*"zeo" + 0.007*"work" + 0.007*"tool" + 0.006*"sensor"
Topic: 5 
Words: 0.021*"device" + 0.019*"test" + 0.011*"people" + 0.011*"rate" + 0.010*"health" + 0.009*"activity" + 0.009*"blood" + 0.008*"measure" + 0.008*"work" + 0.008*"heart"
Topic: 6 
Words: 0.018*"people" + 0.017*"health" + 0.011*"work" + 0.011*"question" + 0.010*"survey" + 0.009*"self" + 0.009*"card" + 0.009*"help" + 0.008*"change" + 0.008*"research"
Topic: 7 
Words: 0.022*"heart" + 0.022*"rate" + 0.013*"app" + 0.011*"health" + 0.010*"project" + 0.009*"question" + 0.009*"people" + 0.008*"research" + 0.008*"sleep" + 0.008*"hrv"
Topic: 8 
Words: 0.020*"food" + 0.015*"people" + 0.011*"track" + 0.011*"app" + 0.010*"idea" + 0.010*"work" + 0.010*"export" + 0.009*"mood" + 0.009*"calorie" + 0.008*"question"
Topic: 9 
Words: 0.012*"temperature" + 0.011*"effect" + 0.011*"post" + 0.010*"body" + 0.009*"measure" + 0.009*"device" + 0.008*"experiment" + 0.008*"people" + 0.008*"work" + 0.007*"question"

TF-IDF

In [87]:
Topic: 0 Word: 0.009*"survey" + 0.007*"health" + 0.007*"sensor" + 0.005*"app" + 0.005*"rate" + 0.005*"fitbit" + 0.005*"device" + 0.004*"research" + 0.004*"heart" + 0.004*"people"
Topic: 1 Word: 0.005*"article" + 0.005*"community" + 0.005*"research" + 0.004*"self" + 0.004*"people" + 0.004*"app" + 0.004*"term" + 0.004*"feedback" + 0.004*"account" + 0.004*"book"
Topic: 2 Word: 0.007*"survey" + 0.005*"app" + 0.005*"project" + 0.005*"activity" + 0.005*"information" + 0.005*"test" + 0.004*"health" + 0.004*"list" + 0.004*"export" + 0.004*"pain"
Topic: 3 Word: 0.006*"people" + 0.005*"study" + 0.004*"research" + 0.004*"self" + 0.004*"work" + 0.004*"tool" + 0.004*"health" + 0.004*"device" + 0.004*"topic" + 0.004*"tracker"
Topic: 4 Word: 0.008*"health" + 0.007*"survey" + 0.006*"food" + 0.006*"research" + 0.005*"people" + 0.005*"post" + 0.005*"test" + 0.005*"question" + 0.004*"device" + 0.004*"heart"
Topic: 5 Word: 0.005*"goal" + 0.005*"app" + 0.005*"people" + 0.004*"project" + 0.004*"tool" + 0.004*"health" + 0.004*"application" + 0.004*"work" + 0.004*"device" + 0.004*"beddit"
Topic: 6 Word: 0.006*"activity" + 0.005*"device" + 0.005*"heart" + 0.005*"rate" + 0.005*"survey" + 0.004*"project" + 0.004*"analysis" + 0.004*"fitbit" + 0.004*"self" + 0.004*"example"
Topic: 7 Word: 0.006*"device" + 0.006*"heart" + 0.006*"rate" + 0.005*"test" + 0.005*"blood" + 0.004*"health" + 0.004*"pain" + 0.004*"tool" + 0.004*"temperature" + 0.004*"people"
Topic: 8 Word: 0.007*"device" + 0.007*"app" + 0.005*"mood" + 0.005*"calorie" + 0.005*"track" + 0.005*"phone" + 0.004*"withings" + 0.004*"body" + 0.004*"scale" + 0.004*"rate"
Topic: 9 Word: 0.005*"mood" + 0.005*"effect" + 0.005*"self" + 0.004*"track" + 0.004*"session" + 0.004*"question" + 0.004*"activity" + 0.004*"blood" + 0.004*"work" + 0.004*"idea"
In [89]:
Score: 0.9263818264007568	 
Topic: 0.015*"blood" + 0.014*"sensor" + 0.010*"device" + 0.008*"health" + 0.007*"measurement" + 0.007*"question" + 0.007*"work" + 0.007*"pressure" + 0.007*"phone" + 0.007*"temperature"

Score: 0.05621885508298874	 
Topic: 0.012*"temperature" + 0.011*"effect" + 0.011*"post" + 0.010*"body" + 0.009*"measure" + 0.009*"device" + 0.008*"experiment" + 0.008*"people" + 0.008*"work" + 0.007*"question"

Performance evaluation

In [90]:
Score: 0.46645259857177734	 
Topic: 0.006*"activity" + 0.005*"device" + 0.005*"heart" + 0.005*"rate" + 0.005*"survey" + 0.004*"project" + 0.004*"analysis" + 0.004*"fitbit" + 0.004*"self" + 0.004*"example"

Score: 0.3016797602176666	 
Topic: 0.008*"health" + 0.007*"survey" + 0.006*"food" + 0.006*"research" + 0.005*"people" + 0.005*"post" + 0.005*"test" + 0.005*"question" + 0.004*"device" + 0.004*"heart"

Score: 0.17425286769866943	 
Topic: 0.005*"mood" + 0.005*"effect" + 0.005*"self" + 0.004*"track" + 0.004*"session" + 0.004*"question" + 0.004*"activity" + 0.004*"blood" + 0.004*"work" + 0.004*"idea"

Score: 0.04456240311264992	 
Topic: 0.005*"article" + 0.005*"community" + 0.005*"research" + 0.004*"self" + 0.004*"people" + 0.004*"app" + 0.004*"term" + 0.004*"feedback" + 0.004*"account" + 0.004*"book"

Testing Unseen docs

In [91]:
Out[91]:
['Pentagon', 'deal', 'identity', 'crisis', 'Google']
In [92]:
Out[92]:
<gensim.models.ldamulticore.LdaMulticore at 0x7fe98d1117c0>
In [93]:
Score: 0.6997727155685425	 Topic: 0.018*"people" + 0.017*"health" + 0.011*"work" + 0.011*"question" + 0.010*"survey"
Score: 0.03336305171251297	 Topic: 0.021*"device" + 0.019*"test" + 0.011*"people" + 0.011*"rate" + 0.010*"health"
Score: 0.033360593020915985	 Topic: 0.012*"temperature" + 0.011*"effect" + 0.011*"post" + 0.010*"body" + 0.009*"measure"
Score: 0.03336052969098091	 Topic: 0.021*"health" + 0.012*"app" + 0.012*"device" + 0.011*"work" + 0.010*"information"
Score: 0.03335874155163765	 Topic: 0.034*"zeo" + 0.017*"device" + 0.012*"headband" + 0.012*"sleep" + 0.010*"app"
Score: 0.03335772082209587	 Topic: 0.022*"heart" + 0.022*"rate" + 0.013*"app" + 0.011*"health" + 0.010*"project"
Score: 0.033357229083776474	 Topic: 0.012*"people" + 0.012*"tool" + 0.012*"work" + 0.011*"question" + 0.009*"device"
Score: 0.03335684537887573	 Topic: 0.020*"food" + 0.015*"people" + 0.011*"track" + 0.011*"app" + 0.010*"idea"
Score: 0.033356454223394394	 Topic: 0.016*"app" + 0.012*"health" + 0.011*"people" + 0.011*"activity" + 0.011*"file"
Score: 0.03335611894726753	 Topic: 0.015*"blood" + 0.014*"sensor" + 0.010*"device" + 0.008*"health" + 0.007*"measurement"

Topics model for the docs

In [94]:
Out[94]:
Slide to adjust relevance metric:(2)
0.00.20.40.60.81.0
PC1PC2Marginal topic distribution2%5%10%12345678910Intertopic Distance Map (via multidimensional scaling)Overall term frequencyEstimated term frequency within the selected topic1. saliency(term w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))] for topics t; see Chuang et. al (2012)2. relevance(term w | topic t) = λ * p(w | t) + (1 - λ) * p(w | t)/p(w); see Sievert & Shirley (2014)zeohealthheartratebloodsensorfoodapptemperatureheadbandtestdevicesleepsurveyfilebodycardmeasureactivityexporteffectlevelaircalorielossinformationpressurepostappleexperiment05001,0001,5002,000Top-30 Most Salient Terms1
In [95]:

Vec LDA model

* data_vectorized

In [98]:
In [99]:
Sparsicity:  2.41582773017756 %
In [100]:

LDA MODEL WITH SKLEARN

In [101]:
LatentDirichletAllocation(learning_method='online', n_components=20, n_jobs=-1,
                          random_state=100)
In [102]:
In [103]:
Out[103]:
GridSearchCV(estimator=LatentDirichletAllocation(),
             param_grid={'learning_decay': [0.5, 0.7, 0.9],
                         'n_components': [10, 15, 20, 25, 30]})
In [104]:
Best Model's Params:  {'learning_decay': 0.9, 'n_components': 10}
Best Log Likelihood Score:  -283303.8335662069
Model Perplexity:  684.6614381114504
In [105]:
In [106]:
In [107]:
Out[107]:
Index(['topic_id', 'no_sw_LDA_text', 'token_NN_text'], dtype='object')
In [108]:
Out[108]:
Topic0 Topic1 Topic2 Topic3 Topic4 Topic5 Topic6 Topic7 Topic8 Topic9 dominant_topic
Doc0 0.000000 0.000000 0.000000 0.340000 0.000000 0.000000 0.630000 0.000000 0.000000 0.000000 6
Doc1 0.140000 0.140000 0.000000 0.220000 0.000000 0.000000 0.000000 0.200000 0.180000 0.100000 3
Doc2 0.000000 0.000000 0.000000 0.100000 0.610000 0.000000 0.290000 0.000000 0.000000 0.000000 4
Doc3 0.000000 0.000000 0.210000 0.170000 0.210000 0.000000 0.120000 0.000000 0.140000 0.150000 2
Doc4 0.000000 0.560000 0.000000 0.000000 0.030000 0.300000 0.100000 0.000000 0.000000 0.000000 1
Doc5 0.000000 0.000000 0.180000 0.000000 0.000000 0.000000 0.000000 0.000000 0.750000 0.070000 8
Doc6 0.120000 0.000000 0.000000 0.110000 0.000000 0.050000 0.000000 0.000000 0.640000 0.080000 8
Doc7 0.000000 0.000000 0.000000 0.210000 0.130000 0.120000 0.370000 0.100000 0.060000 0.000000 6
Doc8 0.650000 0.000000 0.020000 0.000000 0.000000 0.000000 0.000000 0.070000 0.060000 0.200000 0
Doc9 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.010000 0.920000 0.010000 0.010000 7
Doc10 0.220000 0.290000 0.000000 0.040000 0.180000 0.000000 0.000000 0.000000 0.160000 0.110000 1
Doc11 0.000000 0.000000 0.000000 0.220000 0.000000 0.170000 0.000000 0.590000 0.000000 0.000000 7
Doc12 0.000000 0.000000 0.000000 0.040000 0.000000 0.390000 0.010000 0.000000 0.560000 0.000000 8
Doc13 0.000000 0.100000 0.040000 0.220000 0.000000 0.000000 0.000000 0.630000 0.000000 0.000000 7
Doc14 0.000000 0.000000 0.110000 0.000000 0.000000 0.000000 0.220000 0.440000 0.210000 0.000000 7
In [109]:
Out[109]:
Topic Num Num Documents
8 0 127
3 1 219
7 2 133
4 3 202
5 4 163
0 5 380
9 6 107
1 7 348
6 8 159
2 9 279

Coherent LDA model

In [110]:
/home/criuser/miniconda3/lib/python3.9/site-packages/sklearn/manifold/_t_sne.py:691: FutureWarning:

'square_distances' has been introduced in 0.24 to help phase out legacy squaring behavior. The 'legacy' setting will be removed in 1.1 (renaming of 0.26), and the default setting will be changed to True. In 1.3, 'square_distances' will be removed altogether, and distances will be squared by default. Set 'square_distances'=True to silence this warning.

Out[110]:
Slide to adjust relevance metric:(2)
0.00.20.40.60.81.0
PC1PC2Marginal topic distribution2%5%10%12345678910Intertopic Distance Map (via multidimensional scaling)Overall term frequencyEstimated term frequency within the selected topic1. saliency(term w) = frequency(w) * [sum_t p(t | w) * log(p(t | w)/p(t))] for topics t; see Chuang et. al (2012)2. relevance(term w | topic t) = λ * p(w | t) + (1 - λ) * p(w | t)/p(w); see Sievert & Shirley (2014)datafoodzeohearthealthratetestmoodbloodeffectfilesensorhrvdevicesurveydoctorresultpeoplebodycaloriestudytrackerbrainmeasurehabittoolmeditationheadbandtemperatureair02,0004,0006,0008,000Top-30 Most Salient Terms1
In [111]:
Out[111]:
ability abramson absence absolute acceleration accelerometer access account accuracy accurate ... yield yoga york youtube zenobase zeo zephyr zero zone zoom
Topic0 18.809014 0.100008 0.100064 0.100031 5.242336 0.100005 3.602480 1.272275 0.148912 0.100022 ... 0.100032 0.100010 0.100036 0.100031 0.100010 0.100006 0.100000 0.100014 6.049161 0.100021
Topic1 9.905187 0.100010 3.588545 0.100014 0.100007 3.229220 126.798366 66.859274 3.155948 5.695091 ... 4.471704 0.100039 0.100009 3.837936 8.944944 0.100010 0.100003 0.100023 3.854368 0.100009
Topic2 11.009566 13.956326 1.400413 0.100024 0.100032 0.100045 12.037653 0.100019 0.100039 0.100057 ... 0.111906 0.100006 0.100022 1.986120 0.100016 0.100010 0.100010 0.100002 0.100017 0.100005
Topic3 0.100022 0.100011 0.100008 0.100004 6.475913 0.100009 7.218509 8.359815 0.100015 1.941557 ... 2.464022 9.755616 0.100005 0.100014 0.100006 0.100002 0.100002 0.100001 0.100016 4.524078
Topic4 30.752198 0.100001 3.306644 1.024736 0.263682 16.743366 205.728604 10.721505 11.320647 3.867137 ... 1.790313 0.100006 0.100010 5.477858 0.102306 1090.044078 4.555980 10.987226 5.172512 1.131864

5 rows × 1985 columns

Top 15 keywords in each topic

In [112]:
Out[112]:
Word 0 Word 1 Word 2 Word 3 Word 4 Word 5 Word 6 Word 7 Word 8 Word 9 Word 10 Word 11 Word 12 Word 13 Word 14
Topic 0 mood people data card question idea point measure track scale emotion quote entry problem image
Topic 1 data people health work information analysis question tool self term service share research company result
Topic 2 health test blood doctor air sensor level device product quality home company consumer people lab
Topic 3 health tracker people tool goal activity fitness technology survey self step research conference fitbit university
Topic 4 data zeo app file sleep version headband work night post phone battery export unit information
Topic 5 data device app user tool service export activity api track software source work feature access
Topic 6 test effect result study survey brain meditation experiment question measure intervention pill score mind headache
Topic 7 people project question habit research idea book self community change mood experience help work post
Topic 8 food body calorie blood water effect meal eat track weight exercise sugar loss muscle level
Topic 9 heart rate data device sensor hrv measure temperature measurement monitor sleep stress activity watch night
In [113]:
In [114]:
In [115]:
['data', 'zeo', 'app', 'file', 'sleep', 'version', 'headband', 'work', 'night', 'post', 'phone', 'battery', 'export', 'unit', 'information']

Cluster documents with similair topics

In [116]:
Component's weights: 
 [[ 0.15  0.27  0.15  0.26  0.2   0.57  0.11  0.57  0.15  0.31]
 [ 0.03  0.04  0.02  0.09 -0.13 -0.61  0.    0.73 -0.03 -0.24]]
Perc of Variance Explained: 
 [0.03 0.2 ]
In [117]:
Out[117]:
Text(0.5, 1.0, 'Segregation of Topic Clusters')

similar documents by entering text + Create dashboard

In [118]:
Out[118]:
Index(['id', 'user_id', 'created_at', 'cooked', 'post_number', 'updated_at',
       'reply_count', 'reply_to_post_number', 'quote_count',
       'incoming_link_count', 'reads', 'readers_count', 'topic_id',
       'reply_to_user', 'stream', 'tags', 'title', 'posts_count', 'views',
       'like_count', 'closed', 'category_id', 'word_count', 'featured_link',
       'time_read', 'likes_received', 'likes_given', 'topics_entered',
       'topic_count', 'post_count', 'posts_read', 'days_visited', 'username',
       'name', '_merge', 'first_post', 'last_post', 'lifespan',
       'lifespan_days', 'cleaned_text', 'lemmat_text', 'token_text',
       'no_sw_text', 'noHTML_text', 'lemma_LDA_text', 'no_sw_LDA_text',
       'token_NN_text', 'creation_date', 'creation_year'],
      dtype='object')
In [119]:
In [123]:
Topic KeyWords:  ['health', 'tracker', 'people', 'tool', 'goal', 'activity', 'fitness', 'technology', 'survey', 'self', 'step', 'research', 'conference', 'fitbit', 'university']
Topic Prob Scores of text:  [[0.  0.  0.  0.8 0.  0.  0.  0.  0.  0. ]]
Most Similar Doc's Probs:   [[0.  0.  0.  0.8 0.  0.  0.  0.  0.  0. ]]

 757    microbiome sequenced multiple times changing diet traveling interesting educational expect major insights data limited
Name: cleaned_text, dtype: object

Dominant topic in each sentence

In [126]:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-126-abfb6d90f50a> in <module>
     22 
     23 
---> 24 df_topic_sents_keywords = format_topics_sentences(ldamodel= best_lda_model, corpus= corpus, texts= data_vectorized)
     25 
     26 # Format

<ipython-input-126-abfb6d90f50a> in format_topics_sentences(ldamodel, corpus, texts)
      4 
      5     # Get main topic in each document
----> 6     for i, row in enumerate(ldamodel[corpus]):
      7         row = sorted(row, key=lambda x: (x[1]), reverse=True)
      8         # Get the Dominant topic, Perc Contribution and Keywords for each document

TypeError: 'LatentDirichletAllocation' object is not subscriptable

In [127]:
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-127-82fd381f62dc> in <module>
      2 sent_topics_sorteddf_mallet = pd.DataFrame()
      3 
----> 4 sent_topics_outdf_grpd = df_topic_sents_keywords.groupby('Dominant_Topic')
      5 
      6 for i, grp in sent_topics_outdf_grpd:

NameError: name 'df_topic_sents_keywords' is not defined

Topic distribution across posts/topics

In [128]:
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-128-d0e6a0741df8> in <module>
      1 # Number of Documents for Each Topic
----> 2 topic_counts = df_topic_sents_keywords['Dominant_Topic'].value_counts()
      3 
      4 # Percentage of Documents for Each Topic
      5 topic_contribution = round(topic_counts/topic_counts.sum(), 4)

NameError: name 'df_topic_sents_keywords' is not defined

In [129]:
Out[129]:
10282

2

Named Entity Recognition

In [130]:
Out[130]:
'conducted on Cortext platform'
In [131]:
In [132]:
In [133]:
19
37
In [134]:
In [135]:
In [138]:
Out[138]:
entity frequency category
0 QS 2150 ORG
1 HRV 917 ORG
2 Zeo 706 ORG
3 iPhone 393 ORG
4 EEG 333 ORG
... ... ... ...
75 PDF 29 ORG
76 PPG 29 ORG
77 Nexus 28 ORG
78 WHIB 28 ORG
79 PS 28 ORG

80 rows × 3 columns

In [139]:
In [140]:
Out[140]:
<bound method Series.items of 0         QS
1        HRV
2        Zeo
3     iPhone
4        EEG
       ...  
75       PDF
76       PPG
77     Nexus
78      WHIB
79        PS
Name: entity, Length: 80, dtype: object>
In [141]:
In [148]:
In [143]:
In [144]:
In [145]:
In [147]:
QSHRVZeoiPhoneEEGQuantified SelfAndroidGoogleBodyMediaAppleZEOOuraAmazonRRTonicRescueTimeHealthKitMyFitnessPalContinuous MonitoringTrueSenseGSRDIYFacebookFDABMIMicrosoftFitbitApple HealthJawboneiPadUSBSDKJMPBottleZenobaseSelf-Tracking DataInsideTrackerRHRApple WatchBodymediaecccb3/40.pngArrhythmiaCarnivoreVCMFPTipsEKGBPSenseViewLinuxeBayMuseKickstarterHRMthe Quantified SelfSkypeSenseWearHabit DesignData Task ForceIntrinsic Motivation and GamificationBT LEAPICPAPBTPolarTSHOSXNikeuBiomeIOSBloodpressure Testers'QS AccessSQLiteStanfordAnkiPDFPPGNexusWHIBPS
500100015002000frequencyBigram similarity and frequency